Online Evolving Clustering of Web Documents

نویسندگان

  • Anthony Evans
  • Plamen Angelov
  • Xiaowei Zhou
چکیده

In this paper an approach that is using evolving, incremental (on-line) clustering to automatically group relevant Web-based documents is proposed. It is centred on a recently introduced evolving fuzzy rule-based clustering approach and borrows heavily from the Nature in the sense that it is evolution-inspired. That is, the structure of the clusters and their number is not predefined, but it self-develops, they grow and shrink when new web documents are accessed. Existing Web-based search engine technology returns long lists of web pages that contain the user's search term query but are not presented in order of contextual similarity. For example, search terms that have more than one meaning such as "cold" are presented to the user in a list containing documents relating to the Cold War and the common cold. If these results could be clustered “on the fly” then this improved presentation of results would allow the end user to find relevant documents more easily by requiring the inspection of one cluster of contextually similar documents rather than entire list of documents containing information pertaining to irrelevant contexts. An issue that is paid a special attention to is the similarity measure between the textual documents. Euclidean, Levenstein, and Cosine similarity measures have been used with cosine dissimilarity/distance performing best and addressing the problem of different number of features in each document. The proposed evolving classifier has also learning capability – it improves the result on-line with any new document that has been accessed. Finally, the proposed approach is characterized by low complexity. This paper reports the results of research that is going on for more than two years at Lancaster University on development of a novel clustering method that is suitable for real-time implementations. It is based on evolution principles and tries to address the limitations of existing clustering algorithms which cannot cope in an online mode with high dimensional datasets. This evolution-inspired and Nature-inspired approach introduces the new concept of potential values which describes the fitness of a new sample (web document) to be the prototype of a new cluster without the need to store each previously encountered documents but taking into account the contextual similarity density between all previous documents in a recursive and thus computationally efficient way (thereby reducing memory requirements and improving speed compared with existing approaches). This paper examines the clustering of documents by contextual similarity using extracted keywords represented in a vector space model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evolving Better Stoplists for Document Clustering and Web Intelligence

Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence, web mining, search engine design, and so forth. A fundamental tool in such document analysis tasks is a list of so-called ‘stop’ words, called a ‘stoplist’. A stoplist is a specific collection of so-called ‘noise’ word...

متن کامل

Web Document Clustering based on Document Structure

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ...

متن کامل

Incorporating Hyperlink Analysis in Web Page Clustering

The size of the World Wide Web is growing rapidly and it has become a very important source of information that can be useful to various academic and commercial applications. However, because of the large number of documents online, it is becoming increasingly difficult to search for useful information on the Web. General-purpose Web search engines, such as Google and AltaVista, present search ...

متن کامل

Hierarchical Clustering of documents-A brief study and implementation in MATLAB

The paper discusses and implements hierarchical clustering of documents. The objective is to group similar documents together using hierarchical clustering methods. The paper aims at organizing a set of documents into clusters. The paper is focused on Web Content mining by clustering web documents. Clustering is done on document corpus in MATLAB environment. The result is groups or clusters of ...

متن کامل

Clustering-Based Searching and Navigation in an Online News Source

The growing amount of online news posted on the WWW demands new algorithms that support topic detection, search, and navigation of news documents. This work presents an algorithm for topic detection that considers the temporal evolution of news and the structure of web documents. Then, it uses the results of the topic detection algorithm for searching and navigating in an online news source. An...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006